1 Introduction
Build a model that can filter user comments based on the degree of language maliciousness:
- Preprocess the text by eliminating the set of tokens that do not make significant contribution at the semantic level.
- Transform the text corpus into sequences.
- Build a Deep Learning model including recurrent layers for a multilabel classification task.
- At prediction time, the model should return a vector containing a 1 or a 0 at each label in the dataset (toxic, severe_toxic, obscene, threat, insult, identity_hate). In this way, a non-harmful comment will be classified by a vector of only 0s [0,0,0,0,0]. In contrast, a dangerous comment will exhibit at least a 1 among the 6 labels.
2 Setup
Leveraging Quarto and RStudio, I will setup an R and Python enviroment.
2.1 Import R libraries
Import R libraries. These will be used for both the rendering of the document and data analysis. The reason is I prefer ggplot2 over matplotlib. I will also use colorblind safe palettes.
2.2 Import Python packages
Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import keras_nlp
from keras.backend import clear_session
from keras.models import Model, load_model
from keras.layers import TextVectorization, Input, Dense, Embedding, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, GlobalMaxPool1D, Flatten, Attention
from keras.metrics import Precision, Recall, AUC, SensitivityAtSpecificity, SpecificityAtSensitivity, F1Score
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_recall_curve, f1_score, recall_score, roc_auc_scoreCreate a Config class to store all the useful parameters for the model and for the project.
2.3 Class Config
I created a class with all the basic configuration of the model, to improve the readability.
Code
class Config():
def __init__(self):
self.url = "https://s3.eu-west-3.amazonaws.com/profession.ai/datasets/Filter_Toxic_Comments_dataset.csv"
self.max_tokens = 20000
self.output_sequence_length = 911 # check the analysis done to establish this value
self.embedding_dim = 128
self.batch_size = 32
self.epochs = 100
self.temp_split = 0.3
self.test_split = 0.5
self.random_state = 42
self.total_samples = 159571 # total train samples
self.train_samples = 111699
self.val_samples = 23936
self.features = 'comment_text'
self.labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
self.new_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', "clean"]
self.label_mapping = {label: i for i, label in enumerate(self.labels)}
self.new_label_mapping = {label: i for i, label in enumerate(self.labels)}
self.path = "/Users/simonebrazzi/R/blog/posts/toxic_comment_filter/history/f1score/"
self.model = self.path + "model_f1.keras"
self.checkpoint = self.path + "checkpoint.lstm_model_f1.keras"
self.history = self.path + "lstm_model_f1.xlsx"
self.metrics = [
Precision(name='precision'),
Recall(name='recall'),
AUC(name='auc', multi_label=True, num_labels=len(self.labels)),
F1Score(name="f1", average="macro")
]
def get_early_stopping(self):
early_stopping = keras.callbacks.EarlyStopping(
monitor="val_f1", # "val_recall",
min_delta=0.2,
patience=10,
verbose=0,
mode="max",
restore_best_weights=True,
start_from_epoch=3
)
return early_stopping
def get_model_checkpoint(self, filepath):
model_checkpoint = keras.callbacks.ModelCheckpoint(
filepath=filepath,
monitor="val_f1", # "val_recall",
verbose=0,
save_best_only=True,
save_weights_only=False,
mode="max",
save_freq="epoch"
)
return model_checkpoint
def find_optimal_threshold_cv(self, ytrue, yproba, metric, thresholds=np.arange(.05, .35, .05), n_splits=7):
# instantiate KFold
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
threshold_scores = []
for threshold in thresholds:
cv_scores = []
for train_index, val_index in kf.split(ytrue):
ytrue_val = ytrue[val_index]
yproba_val = yproba[val_index]
ypred_val = (yproba_val >= threshold).astype(int)
score = metric(ytrue_val, ypred_val, average="macro")
cv_scores.append(score)
mean_score = np.mean(cv_scores)
threshold_scores.append((threshold, mean_score))
# Find the threshold with the highest mean score
best_threshold, best_score = max(threshold_scores, key=lambda x: x[1])
return best_threshold, best_score
config = Config()3 Data
The dataset is accessible using tf.keras.utils.get_file to get the file from the url. N.B. For reproducibility purpose, I also downloaded the dataset. There was time in which the link was not available.
Code
library(reticulate)
py$df %>%
tibble() %>%
head(5) %>%
gt() %>%
tab_header(
title = "First five observations"
) %>%
cols_align(
align = "center",
columns = c("toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate", "sum_injurious")
) %>%
cols_align(
align = "left",
columns = comment_text
) %>%
cols_label(
comment_text = "Comments",
toxic = "Toxic",
severe_toxic = "Severe Toxic",
obscene = "Obscene",
threat = "Threat",
insult = "Insult",
identity_hate = "Identity Hate",
sum_injurious = "Sum Injurious"
)| First five observations | |||||||
|---|---|---|---|---|---|---|---|
| Comments | Toxic | Severe Toxic | Obscene | Threat | Insult | Identity Hate | Sum Injurious |
| Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info. | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| " More I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents"" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know. There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport " | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| You, sir, are my hero. Any chance you remember what page that's on? | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Lets create a clean variable for EDA purpose: I want to visually see how many observation are clean vs the others labels.
3.1 EDA
First a check on the dataset to find possible missing values and imbalances.
3.1.1 Frequency
Code
library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels
df_r_grouped <- df_r %>%
select(all_of(new_labels_r)) %>%
pivot_longer(
cols = all_of(new_labels_r),
names_to = "label",
values_to = "value"
) %>%
group_by(label) %>%
summarise(count = sum(value)) %>%
mutate(freq = round(count / sum(count), 4))
df_r_grouped %>%
gt() %>%
tab_header(
title = "Labels frequency",
subtitle = "Absolute and relative frequency"
) %>%
fmt_number(
columns = "count",
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = TRUE,
use_seps = TRUE
) %>%
fmt_percent(
columns = "freq",
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("count", "freq")
) %>%
cols_align(
align = "left",
columns = label
) %>%
cols_label(
label = "Label",
count = "Absolute Frequency",
freq = "Relative frequency"
)Code
library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels
df_r_grouped <- df_r %>%
select(all_of(new_labels_r)) %>%
pivot_longer(
cols = all_of(new_labels_r),
names_to = "label",
values_to = "value"
) %>%
group_by(label) %>%
summarise(count = sum(value)) %>%
mutate(freq = round(count / sum(count), 4))
df_r_grouped %>%
gt() %>%
tab_header(
title = "Labels frequency",
subtitle = "Absolute and relative frequency"
) %>%
fmt_number(
columns = "count",
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = TRUE,
use_seps = TRUE
) %>%
fmt_percent(
columns = "freq",
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("count", "freq")
) %>%
cols_align(
align = "left",
columns = label
) %>%
cols_label(
label = "Label",
count = "Absolute Frequency",
freq = "Relative frequency"
)| Labels frequency | ||
|---|---|---|
| Absolute and relative frequency | ||
| Label | Absolute Frequency | Relative frequency |
| clean | 143,346 | 80.33% |
| identity_hate | 1,405 | 0.79% |
| insult | 7,877 | 4.41% |
| obscene | 8,449 | 4.73% |
| severe_toxic | 1,595 | 0.89% |
| threat | 478 | 0.27% |
| toxic | 15,294 | 8.57% |
3.1.2 Barchart
Code
library(reticulate)
barchart <- df_r_grouped %>%
ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
geom_col() +
labs(
x = "Labels",
y = "Count"
) +
# sort bars in descending order
scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
scale_fill_brewer(type = "seq", palette = "RdYlBu") +
theme_minimal()
ggplotly(barchart)Code
library(reticulate)
barchart <- df_r_grouped %>%
ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
geom_col() +
labs(
x = "Labels",
y = "Count"
) +
# sort bars in descending order
scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
scale_fill_brewer(type = "seq", palette = "RdYlBu") +
theme_minimal()
ggplotly(barchart)It is visible how much the dataset in imbalanced. This means it could be useful to check for the class weight and use this argument during the training.
It is clear that most of our text are clean. We are talking about 0.8033 of the observations which are clean. Only 0.1967 are toxic comments.
3.2 Sequence lenght definition
To convert the text in a useful input for a NN, it is necessary to use a TextVectorization layer. See the Section 4 section.
One of the method is output_sequence_length: to better define it, it is useful to analyze our text length. To simulate what the model we do, we are going to remove the punctuation and the new lines from the comments.
3.2.1 Summary
<<<<<<< HEADCode
library(reticulate)
df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
) %>%
pull(text_length) %>%
summary() %>%
as.list() %>%
as_tibble() %>%
gt() %>%
tab_header(
title = "Summary Statistics",
subtitle = "of text length"
) %>%
fmt_number(
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = TRUE,
use_seps = TRUE
) %>%
cols_align(
align = "center",
) %>%
cols_label(
Min. = "Min",
`1st Qu.` = "Q1",
Median = "Median",
`3rd Qu.` = "Q3",
Max. = "Max"
)Code
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aad| Summary Statistics | |||||
|---|---|---|---|---|---|
| of text length | |||||
| Min | Q1 | Median | Mean | Q3 | Max |
| 4 | 91 | 196 | 378.4 | 419 | 5,000 |
3.2.2 Boxplot
Code
library(reticulate)
boxplot <- df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
) %>%
# pull(text_length) %>%
ggplot(aes(y = text_length)) +
geom_boxplot() +
theme_minimal()
ggplotly(boxplot)Code
library(reticulate)
boxplot <- df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
) %>%
# pull(text_length) %>%
ggplot(aes(y = text_length)) +
geom_boxplot() +
theme_minimal()
ggplotly(boxplot)3.2.3 Histogram
Code
library(reticulate)
df_ <- df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
)
Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)
histogram <- df_ %>%
ggplot(aes(x = text_length)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
theme_minimal() +
xlab("Text Length") +
ylab("Frequency") +
xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)Code
library(reticulate)
df_ <- df_r %>%
mutate(
comment_text_clean = comment_text %>%
tolower() %>%
str_remove_all("[[:punct:]]") %>%
str_replace_all("\n", " "),
text_length = comment_text_clean %>% str_count()
)
Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)
histogram <- df_ %>%
ggplot(aes(x = text_length)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
theme_minimal() +
xlab("Text Length") +
ylab("Frequency") +
xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)Considering all the above analysis, I think a good starting value for the output_sequence_length is 911, the upper fence of the boxplot. In the last plot, it is the dashed red vertical line.. Doing so, we are removing the outliers, which are a small part of our dataset.
3.3 Dataset
Now we can split the dataset in 3: train, test and validation sets. Considering there is not a function in sklearn which lets split in these 3 sets, we can do the following: - split between a train and temporary set with a 0.3 split. - split the temporary set in 2 equal sized test and val sets.
Code
x = df[config.features].values
y = df[config.labels].values
xtrain, xtemp, ytrain, ytemp = train_test_split(
x,
y,
test_size=config.temp_split, # .3
random_state=config.random_state
)
xtest, xval, ytest, yval = train_test_split(
xtemp,
ytemp,
test_size=config.test_split, # .5
random_state=config.random_state
)Code
x = df[config.features].values
y = df[config.labels].values
xtrain, xtemp, ytrain, ytemp = train_test_split(
x,
y,
test_size=config.temp_split, # .3
random_state=config.random_state
)
xtest, xval, ytest, yval = train_test_split(
xtemp,
ytemp,
test_size=config.test_split, # .5
random_state=config.random_state
)xtrain shape: py$xtrain.shape ytrain shape: py$ytrain.shape xtest shape: py$xtest.shape ytest shape: py$ytest.shape xval shape: py$xval.shape yval shape: py$yval.shape
The datasets are created using the tf.data.Dataset function. It creates a data input pipeline. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The tf.data.Dataset is an abstraction that represents a sequence of elements, in which each element consists of one or more components. Here each dataset is creates using from_tensor_slices. It create a tf.data.Dataset from a tuple (features, labels). .batch let us work in batches to improve performance, while .prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Check the documentation for further informations.
Code
train_ds = (
tf.data.Dataset
.from_tensor_slices((xtrain, ytrain))
.shuffle(xtrain.shape[0])
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)
test_ds = (
tf.data.Dataset
.from_tensor_slices((xtest, ytest))
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)
val_ds = (
tf.data.Dataset
.from_tensor_slices((xval, yval))
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)Code
print(
f"train_ds cardinality: {train_ds.cardinality()}\n",
f"val_ds cardinality: {val_ds.cardinality()}\n",
f"test_ds cardinality: {test_ds.cardinality()}\n"
)Code
train_ds = (
tf.data.Dataset
.from_tensor_slices((xtrain, ytrain))
.shuffle(xtrain.shape[0])
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)
test_ds = (
tf.data.Dataset
.from_tensor_slices((xtest, ytest))
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)
val_ds = (
tf.data.Dataset
.from_tensor_slices((xval, yval))
.batch(config.batch_size)
.prefetch(tf.data.experimental.AUTOTUNE)
)Code
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aadtrain_ds cardinality: 3491
val_ds cardinality: 748
test_ds cardinality: 748
Check the first element of the dataset to be sure that the preprocessing is done correctly.
(array([b'Common knowledge in this context, means that it is easy to reference. \xe2\x80\x94 | Talk',
b'I ALREADY ASKED YOU TO LEAVE ME ALONE \n\nIT IS APPARENT THAT I WAS RIGHT YOU SHITFACE LITTLE ASSWAD. CANT LEAVE WELL ENOUGH ALONE. GET THE FUCK AWAY FROM ME, I HAVE NO INTENTIONS OF EVER COMMENTING YOU AGAIN IF YOU JUST LEAVE ME ALONE!',
b'I just asked a few people at NK in Meah Shearim and each confirmed that it was written by his son. | (talk)',
b'"\n\n Ordu page \n\nROOB323, you reverted my edit without reason. If you read the Ordu discussion, I have established that the Balakian source is biased and based upon Dardian\'s biased point of view. The Vakit source dating to 1933 where Dardian cites as a source, also was not corroborated by any other person - academic or none, dead or alive. You are reverting to an edit in discussion and you did not discuss it before or after your edit. That amounts to ""Sneaky Vandalism"" and distruption of an ongoing discussion. Refer to Nishkid64 if you have issues with that. "',
b'" (UTC)\n\nPlease change ""irish-born British"" to ""irish"" or to ""irish who worked mainly in London"" because of the following; The says he was born AND grew up in Ireland apart from a stint in London during WW1. He then went travelling in 1926/1928 to London, Berlin, Paris and back to London. Also, when Francis was born in Ireland under British rule he would be British by his passport and then he would be automatically Irish by 1922 and could later reclaim his British nationality which there is no evidence that he did. So he\'s irish until proven otherwise. Further to that it threatens the credibility of Wikipedia if a survey of people decide he is British because they think so. Someone\'s nationality cannot decided by concensus which is whats happening here. Ignoring the above facts above would be more than disappointing but also from what we KNOW it it is incorrect and an injustice to be so arbitrary. 109.78.211.58 14:05, 11 August 2013"',
b'March 2007 \nPlease do not add nonsense to pages like you did to the Zeus article. If you need someplace to practice editing, please use the sandbox. Thank you.',
b'Give me an answer and I will be satisfied for the next three decades!',
b"South Tibet \n\nI think it's wonderful that you plan to copyedit and improve the article on South Tibet. Please have a look at the discussion page before you edit the article, there's quite a bit of material that I've removed because it wasn't properly sourced, but it could still be useful. \xe2\x80\x94",
b". It's a fairly short article about an American company",
b'"\n\n Re: Loaded (band) \n\nHi, I honored that you thought to ask me. I\'m going to be traveling most of this week so I won\'t get to it until the weekend. You might want start a post at Wikipedia talk:WikiProject Rock music to see if anyone wants to take a look at it in the meantime. Good luck (talk page) "',
b"It is still there. i have tryed to go to the page to delete the rey sabu match but on the edit page it doesn't show up. yet it is still seen on the main page. why is this",
b'And you have done such a great job not helping me figure out how Wikipedia works. If there is no source I figure you would need to source it since that is the base of your whole argument, but you are now telling me you do not need to source it. So as you can see, I have no idea what you are trying to say, and hopefully you explain it instead of doing your usual one vague response and run thank you very much.',
b'"\nOh, yes, of course, whatever we use for this list, we should include a few paragraphs explaining the breadth of possibilities, even down to ""imponderables"" and (perhaps) a brief mention of the really fringe stuff (You know, Berlin Wall, that sort of thing). Basically, make it clear what the breadth is, then present a list of the most common.\nIf we do this right, it should be possible, maybe even easy, to get this to Featured list (WP:FLC) talk "',
b"I learned so much thank to you \n\nI love Alanis and her work since her first song.\n\nI'd like to thank everybody who wrote this because it's really well done.\nI would have helped if it was no so complete...\n\nI will try anyway...\n\nMaryjane",
b'"\n\n Erd\xc5\x91s\xe2\x80\x93Bacon number \n\nIf the article is going to have her Erdos-Bacon number, the article should explain what an Erdos-Bacon number is. I think many readers will be unfamiliar with the concept, which will lead them to be puzzled. I have added in a few sentences explaining what an E-B number is. \xe2\x80\xa2 TALK "',
b'Notable Townsvilleans \n\nNathan Burgers - Goal Keeper for the Australian Mens Field Hockey Team122.109.134.110',
b"J. Whistler\n\nKindly leave my page as is. Rembrandt and Paul (da 'man) C. feel that same way - may we suggest another hobby that would be more suitable for your free time, like hunting for dangling participles? Here's one for a start:\n\n-Swinging wildly through the trees, the children were delighted by the monkeys. \n\nIt's good fun when you get going. Whistlersghost",
b'Welcome!\n\nHello, , and welcome to Wikipedia! Thank you for your contributions. I hope you like the place and decide to stay. Here are some pages that you might find helpful:\nThe five pillars of Wikipedia\nTutorial\nHow to edit a page\nHow to write a great article\nManual of Style\nI hope you enjoy editing here and being a Wikipedian! Please sign your messages on discussion pages using four tildes (~~~~); this will automatically insert your username and the date. If you need help, check out Wikipedia:Questions, ask me on , or ask your question on this page and then place {{helpme}} before the question. Again, welcome! \n\nEdits to New Jersey Devils\nHi! I just wanted to mention that I removed the commentary on games 1 and 2 of this years playoffs from the main New Jersey Devils article, as it goes into a little too much detail for that article. However, the season article, 2008\xe2\x80\x9309 New Jersey Devils season requires a great deal of prose and commentary, and would be far more suitable for edits related to single games of the season. Cheers, lute',
b'"\nTake care! \xe2\x80\x94 \xc2\xb7 [ TALK ] "',
b'"\nYou do realize that a vast body of literature and academic works refers to the places in Ukraine solely by their Russian names, especially in historical contexts? Hard as it may be for you to believe, but an average Western reader would not necessarily know that ""Kharkiv"" and ""Kharkov"" refer to the same place. Including Russian names in the lede helps clarify this point, makes cross-referencing historical literature and Wikipedia articles possible, and generally improves our readers\' experience. Call it the heritage of Russian imperialism or whatever, but the fact remains that including Russian names in these articles is better for readers than not including them. This said, I don\'t see why names in Russian proper should be featured so prominently; surely having transliterated Russian names should be sufficient? At least let\'s tuck Russian proper (and Russian pronunciation) into a footnote.\xe2\x80\x94\xc2\xa0\xe2\x80\xa2\xc2\xa0(yo?); February\xc2\xa06, 2013; 19:45 (UTC)"',
b"Hi. \n\nHey. Can you give me a good thing to research? It's not homework, it's just that I'm bored. Ian Thanks.",
b'I disagree, Sarek. Much like the stories of an eccentric owner complete the narrative, the house falling into disrepair says something about the property. Granted, WP:OSE, but the narrative of the Jacob Kamm House would be incomplete without mentions of the fire and its decline. Similarly, these elements paint the picture of a notable house that has been improperly maintained.',
b'"Starting position(s) ==\nThis might not matter at all but attempting to be correct\nThe photo and diagram showing the start position are actually mirror image of each other\'s position not the same position. I know game play would be exactly the same but chess mirror imaged would also play the same but is generally considered wrong. Edward de Bono\'s web site doesn\'t comment if one is right and the other is wrong but the picture on his site matches the position of the photo here not the diagram; so the pieces look like an ""L"" not a mirror image of one. I don\'t know if it\'s mentioned in the rules with the game of anyone who owns one. If this is so the diagram may need to be flipped. \n\nIt maybe should be mentioned that the mirror image start position is either acceptable, wrong, or simply not mentioned by de Bono but that game play would be the same anyway. \n \n\n== "',
b"Good to see you left \n\nGood. You are a terrible editor. Don't come back. Not even to read an article.",
b'"\n\n Family members on opposite teams \n\nGiants DE Justin Tuck and Pats LB Adalius Thomas are both cousins. Should this be mentioned somewhere? Referenced at bottom, his personal life. EndlessDan "',
b'"\n\nP.S. Just in case you\'ve forgotten your own words of yesterday, here they are: ""You\'ve found a source which shows their sympathies, or professional view, or whatever, is not anti-abortion. It may even establish their position as pro-choice, I\'m not sure - I\'ll have to think that one over."" And now you seem to be saying that I must be some kind of moron for agreeing with what you said yesterday. Please. "',
b"Dear Martin, thanks for your input and contribution to the Biva articles. Wikimedia Commons is an online repository of free-use images. The work of Paul Biva is in the public domain in countries with a copyright term of life of the author plus 100 years or less. I've just added the template PD-art-100 to the Commons images for Paul Biva. There is no reason why the photos you uploaded cannot be used in the French Paul Biva article. The user that removed them for the Paul Biva article was not justified in doing so. They need not be published. My only recommendation is to use good quality images. Pictures from auction sales are allowed. The images you uploaded are in the public domain because they represent works of art painted by a French artist who died more than 100 years ago. I will see if I can find some nice works by P. Biva online, and I will upload them to Commons, then I will proceed to post them in the French Paul Biva article. Let me know what you think. P.S. I would love for you to upload more Paul and Henri Biva works some to Commons.",
b"Iphone eh? What kind of IP address does that have? Anyway, I can wait until you are on your computer or you can just scroll down a bit to the conversation; until then I'll breathlessly wait and compose personal attacks and tendentious arguments. And really Viriditas, why are you still harping on me? I've already made it clear that you've driven me away from wikipedia for good. Do really want me gone now instead of waiting for the ArbCom case to come to a close?",
b'How can i upload image ( logo ) for the page ? If anyone can help me please ?',
b'Hello there i noticed that you have several abusive reverts , on mutliple albanian articles . Removing sourced , valuable , relevant content such as this one . If this persists i will have to contact an administrator . Please do familiarize yourself with the economical statistics of Albania by going here > http://www.financa.gov.al/files/userfiles/Programimi_EkonomikoFiskal/Kuadri_Makroekonomik_dhe_Fiskal/KMF_Periudhen_2014-2016_VKM_NR.73_date_23.01.2013.pdf , source > The ministry of Economics of Albania . ( )',
b'"\n\n Hydrogen ions vs protons \n\n""Hydrogen ion is recommended by the International Union of Pure and Applied Chemistry as a general term for all ions of hydrogen and its isotopes"" - see Hydrogen ion. A Google search reveals that considerably more botanical articles use ions and ion flux in preference to protons. H+ denotes the provenance of the ion as being from Hydrogen, whereas proton is a sort of Deus ex machina with no hint as to where it originated. "',
b"Thank you, the article did need more neutral sources. New reliable, significant and independent sources have been added to this article from third party sources that establish Crouch's role as notable by Wikipedia standards. Crouch is listed as DCEO of HLE on a new source in the Wikipedia article from the Florida Dept. of State/Corporations Division. The United Nations has officially recognized the contributions of Smile of a Child foundation as notable (source in the article from the UN website), and Smile of a Child TV is a national network, and the only one of it's kind.http://en.wikipedia.org/wiki/Smile_of_a_Child_TV. Additionally, Jan Crouch is a widely recognized TV personality which is documented in numerous third party publications (pro or con). Her professional, and broadcasting career is separate and uniquely different than that of Paul Crouch.71.97.55.109"],
dtype=object), array([[0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0],
=======
(array([b"The only way that what Marc Shepherd said would be done most likely is if there's a Wikiproject for for NYC Transit, MTA, and MaBSTOA buses, which I doubt is needed or wanted.",
b'I have added the paragraph that answers this question.',
b'"\nthanks, I\'ll hunt some up. (Talk) "',
b"Put 'em put 'em you hermit crab lovin gekko photographing homun72.221.92.43 culus.",
b', rather than a genuine concern over a particular policy violation',
b'"\nComment Hi There, In spite all my effort, I don\xe2\x80\x99t expect any understanding or help from the other side regarding the English version of the article about Alexander Voytovych. Therefore if one insists on deleting this particular article then go ahead as it seems to me there is no room for democratic thinking on your website whatsoever. I\xe2\x80\x99d like to add a few more words as my final statement. Wikipedia says: \xe2\x80\x9cSources may encompass published works in all forms and media, and in any language. http://en.wikipedia.org/wiki/Wikipedia:Notability. The art works and art life of the artist is the sourcefor academic press, Ukrainian media, academic institutions, as well as poets, writers and publications use his paintings & drawings to illustrate books. The artist takes a part in a busy art life in the Ukraine presenting a number of exhibitions & art projects. Having said that I see that there is no enough evidence for one????? \nTABBodnar O./\xd0\x91\xd0\xbe\xd0\xb4\xd0\xbd\xd0\xb0\xd1\x80 \xd0\x9e. \xe2\x80\x94 \xd0\x9b. \xd0\xa2\xd0\xb0\xd0\xb9\xd0\xbd\xd0\xb0\xd0\xb2\xd1\x82\xd1\x96\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xbd\xd1\x8f. \xe2\x80\x94 \xd0\xa3\xd0\xb6\xd0\xb3\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb4: \xd0\x9a\xd0\xb0\xd1\x80\xd0\xbf\xd0\xb0\xd1\x82\xd0\xb8, 2009. \xe2\x80\x94 \xd0\xa1. 237\xe2\x80\x94239. \xe2\x80\x94 f\xd0\xbet\xd0\xbe: \xd1\x81. 227, 270.ISBN 978-966-671-179-6 \nTABGavrosh A./\xd0\x93\xd0\xb0\xd0\xb2\xd1\x80\xd0\xbe\xd1\x88 \xd0\x9e. \xd0\xa2\xd1\x96\xd0\xbb\xd0\xbe\xd0\xbb\xd1\x83\xd1\x87\xd0\xbd\xd0\xb8\xd1\x86\xd1\x96. \xe2\x80\x94 \xd0\x9b.:\xd0\x9f\xd1\x96\xd1\x80\xd0\xb0\xd0\xbc\xd1\x96\xd0\xb4\xd0\xb0, 2006. \xe2\x80\x94 56 \xd1\x81. \xe2\x80\x94 \xd0\x86\xd0\xbb\xd1\x8e\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x86\xd1\x96\xd1\x97: \xd0\x9e.\xd0\x92\xd0\xbe\xd0\xb9\xd1\x82\xd0\xbe\xd0\xb2\xd0\xb8\xd1\x87. ISBN 966-8522-69-6 \nTABDidyk N./\xd0\x94\xd1\x96\xd0\xb4\xd0\xb8\xd0\xba \xd0\x9d. Fundamentals of composition. \xe2\x80\x94 \xd0\xa3\xd0\xb6\xd0\xb3\xd0\xbe\xd1\x80\xd0\xbe\xd0\xb4: \xd0\x9c\xd0\xb8\xd1\x81\xd1\x82\xd0\xb5\xd1\x86\xd1\x8c\xd0\xba\xd0\xb0\xd0\xbb\xd1\x96\xd0\xbd\xd1\x96\xd1\x8f, 2009. \xe2\x80\x94 \xd0\xa1.44. ISBN 978-966-8764-96-7 \nTABKosmolinska N./\xd0\x9a\xd0\xbe\xd1\x81\xd0\xbc\xd0\xbe\xd0\xbb\xd1\x96\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb0 \xd0\x9d. Art chat on ""Green sofa"" // FINE ART. \xe2\x80\x94 2009. \xe2\x80\x94 \xe2\x84\x96 4. \xe2\x80\x94 \xd0\xa1. 114 \xe2\x80\x94 115. \nTABChervatiuk L./\xd0\xa7\xd0\xb5\xd1\x80\xd0\xb2\xd0\xb0\xd1\x82\xd1\x8e\xd0\xba \xd0\x9b. Women\'s image in the Ukrainian modern art. \xe2\x80\x94 \xd0\x9a.:\xd0\x9d\xd0\xb0\xd0\xb2\xd1\x87\xd0\xb0\xd0\xbb\xd1\x8c\xd0\xbd\xd0\xb0\xd0\xba\xd0\xbd\xd0\xb8\xd0\xb3\xd0\xb0, 2007. \xe2\x80\x94 C. 33 \xe2\x80\x94 34.ISBN 978-966-329-110-9 *TABShumylovych B./\xd0\xa8\xd1\x83\xd0\xbc\xd0\xb8\xd0\xbb\xd0\xbe\xd0\xb2\xd0\xb8\xd1\x87 \xd0\x91. Women, paintings and allegories in high heels. // \xd0\x9e\xd0\xb1\xd1\x80\xd0\xb0\xd0\xb7\xd0\xbe\xd1\x82\xd0\xb2\xd0\xbe\xd1\x80\xd1\x87\xd0\xb5\xd0\xbc\xd0\xb8\xd1\x81\xd1\x82\xd0\xb5\xd1\x86\xd1\x82\xd0\xb2\xd0\xbe. \xe2\x80\x94 2008. \xe2\x80\x94 \xe2\x84\x96 4. \xe2\x80\x94 \xd0\xa1. 92 \xe2\x80\x94 93. \nIf one convinced that the article has to be deleted, than do not hesitate to do so, but before that please give me a straight answer to my question. What is the difference between the article about Alexander Voytovych and, for example, the Herb Roe\xe2\x80\x99s article, why the mentioned article is considered more accurate? I must highlight once again that instead of explaining what has to be changed or corrected on the page, one awards us with thenoticeboards. Regarding self-promotion which was mentioned a few times I\xe2\x80\x99d like to make it clear the artists is a Ukrainian and naturally all sources are Ukrainian, and if the mentioned language is unknown to one it doesn\xe2\x80\x99t meant that it allows one doubts the genuine sources. The Ukrainian and Russian versions of Wikipedia have accepted the article and honoured it, which says a lot.I had in my mind to present more Ukrainian artist, however having such an experience; I\xe2\x80\x99d rather leave this business behind thanks to you. Thanks for co-operation. From my point of view this discussion is exceeded, feel free to delete the article!\n "',
b'"\n\nSpeedy deletion of Seby torreblanca\n Please do not make personal attacks. Wikipedia has a strict policy against personal attacks. Attack pages and images are not tolerated by Wikipedia and are speedily deleted. Users who continue to create or repost such pages and images in violation of our biographies of living persons policy will be blocked from editing Wikipedia. Thank you. \n\nIf you think that this notice was placed here in error, you may contest the deletion by adding to the top of the page (just below the existing speedy deletion or ""db"" tag), coupled with adding a note on the talk page explaining your position, but be aware that once tagged for speedy deletion, if the article meets the criterion it may be deleted without delay. Please do not remove the speedy deletion tag yourself, but don\'t hesitate to add information to the article that would would render it more in conformance with Wikipedia\'s policies and guidelines. Talk "',
b"No defense for firing? This was a man who was found guilty of sexual harassment by a government investigator, and then later convicted in a court of law. If you have information that is not in the public domain, such as the secret legal opinion obtained by Lorne Calvert's government concerning the firing of Murdoch Carriere, then I would implore you to table that opinion here (or on the internet). Otherwise, all that can be concluded is that it is sheer misconduct and bungling on the part of Calvert and his government that led to the payment of $275,000 plus additional sums relating to pension credits.\n\nTable the legal opinion if you have it, or tell your cronies in the party to convince Calvert to table the opinion under the immunity of Parliamentary Priviledge. Otherwise, your beloved leader and party will continue to fall victim to the criticisms that have been levelled against it, and will not be able to credibly defend itself.",
b'"\n\nU.S. support to Juarez, especially in the form of weapons, came after the end of the American Civil War. They also allowed Juarez\' government to operate for some time from American soil. That doesn\'t seem like ""little help"" to me. e | \xcf\x84\xce\xb1\xce\xbb\xce\xba "',
b'even that security council resolution states that israel has put the golan under civilian law, they just dont recongnize its annexation. i agree with the marbehtorah change it to Territories under ISraeli control209.255.127.242',
b'"Hi. I\'m from the Willamette congregation Eugene, OR. Sorry don\'t know any McGhee\'s in California. I like your ""Opposition"" idea. Cheers, Matt M. ) 03:42, 16 Jun 2005 (UTC)\n\n"',
b'Is Smith married? Gay? What?',
b'Kosovo and Serbia\nSame edit war. Same problem. See Talk:Serbia. 204.52.215.107',
b"They are different volcanoes. I unfortunately haven't got a photo of Hl\xc3\xb6\xc3\xb0ufell to upload but here's a link to one: . I can't seem to find a decent map of Iceland online, but Hl\xc3\xb6\xc3\xb0ufell is southwest of Langj\xc3\xb6kull and Her\xc3\xb0ubrei\xc3\xb0 is north of Vatnaj\xc3\xb6kull (just northeast of Askja). I would think that Her\xc3\xb0ubrei\xc3\xb0 is considered Iceland's most famous tuya and I haven't heard any such sentiment about Hl\xc3\xb6\xc3\xb0ufell. Neither of the names mean tuya, but I think the best translation of tuya might be B\xc3\xbarfell (which name many Icelandic mountains possess). Best regards,",
b'Ogstrokes and 24.239.149.9 are the same person. 24.239.149.9 was before registration, and OGstrokes is after registration.',
b'At Bookfinder.com I found Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference\n\n Softcover, ISBN: 1558604790 Publisher: Morgan Kaufmann Pub, 1988\n\n Bookfinder',
b'It was not an attack on someone, she is a personal friend of mine and i was doing it as a temporary gag. As soon as i was told it was going to be deleted, i deleted the contents accordingly. yes the page should have been deleted but it was not an attack on anyone.',
b'"\nI countered your complaint about ""operation iraqi freedom"" with the fact that it is not even the article title and thus changing the link to a neutral term in no way shows bias. I removed ""just cause"" just because it is a propaganda term. "',
b'"\nThen I will just request you to please upload the image. It is the most appropriate image for the article. \'\'\'\'\'\' "',
b'TUSC token eaf3126a18fae1f097f990290c84672a \n\nI am now proud owner of a TUSC account!',
b"Yes Bollyjeff, that is my thought, its not her debut film. If you go around web, you'll find all fancruft sites saying 2002,2003 but, you can check album details of film in allnusic, it says 2003. Music was released in2003 but film was delayed, much like Ishq In Paris whose music was released but now it seems forgettable.You got my point what I'm saying.",
b'"\nHere\'s a link you may find helpful. Wikipedia:Mediation. Charge! "',
b"This album is very similar to Clayman in all aspects, except perhaps it have a softer approach. So why don't leave just melodeath and hard rock",
b'"Based on reports in the media this week, and from comments made by others who have experienced what I went through, it\'s apparently not uncommon in Wikipedia for a single person to create several different personalities, or ID\'s, and to then use them themselves, or in concert with others, to create the appearance of unrelated people making edits, challenging posters, harrassing and threatening others, and worse. So, your statment that ""..hmm, no, it was three people."", rings hollow with some of us.\n \n\n"',
b'RfA thanks \n\nThank you very much for your support at my RfA. Regards, (talk)',
b'"\n\n Shouldn\'t byte range locking be mentioned for the Windows platform? \n\nFor Unix locking, it is mentioned that ""different kinds of locks may be applied to different sections (byte ranges) of a file"". This is not mentioned for the Windows platform. \n\nHowever, when I checkout the LockFile function in msdn, it seems to me that this is also possible on the windows platform, though only for ""server"" editions of the OS. I\'m not a file-locking specialist, so I\'m hesitating a bit to edit the article..."',
b'Myspace worm\n\nDo we really need to mention the FSM Myspace worm? It seems awfully unrelated to the actual Pastafarianism concept. -',
b'http://indianmilitarynews.files.wordpress.com/2011/05/indian-army-rape-us.jpg\nIndian kids trying to rape my user page ( 86.182.174.123',
b'"\nGo spill your philosophies elsewhere. This isn\'t a church. (mailbox) "',
b"No, the reference is talking about Total Shakira's album sales in US, is not Sale el Sol WW sales. Nielsen SoundScan only counts US sales, only IFPI give WW sales.",
b"Followup: Ten edits from this IP (to date) include seven that were quickly reverted by other editors as worthless. Five of those seven (plus one more constructive edit) relate to emergency medicine, and four of the five are bad jokes/deletable nonsense involving Dunkin' Donuts, dating back more than a year (to edits of Paramedic on 30 March 2007). The difference between running gag and repeat vandalism is subtle, but nonetheless real: It's comedy shows that have running gags, whereas freely-editable encyclopedias have repeat vandalism. If we have any more trouble with this, I believe the evidence would get us an edit ban.",
b"Now who's all up in arms about cultural minutiae. Face it bud, you didn't do you homework. Nor did you discuss on the talk page before you slapped a tag on it. I'm sure you can find some sort of wikijustification for your poor research. Just keep to stuff you know from here on out. Otherwise you'll get your ass handed to you over and over again. 71.102.2.128"],
dtype=object), array([[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aad
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
<<<<<<< HEAD
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
=======
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aad
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0]]))
And we check also the shape. We expect a feature of shape (batch, ) and a target of shape (batch, number of labels).
Code
print(
f"text train shape: {train_ds.as_numpy_iterator().next()[0].shape}\n",
f" text train type: {train_ds.as_numpy_iterator().next()[0].dtype}\n",
f"label train shape: {train_ds.as_numpy_iterator().next()[1].shape}\n",
f"label train type: {train_ds.as_numpy_iterator().next()[1].dtype}\n"
)Code
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aadtext train shape: (32,)
text train type: object
label train shape: (32, 6)
label train type: int64
4 Preprocessing
Of course preprocessing! Text is not the type of input a NN can handle. The TextVectorization layer is meant to handle natural language inputs. The processing of each example contains the following steps: 1. Standardize each example (usually lowercasing + punctuation stripping) 2. Split each example into substrings (usually words) 3. Recombine substrings into tokens (usually ngrams) 4. Index tokens (associate a unique int value with each token) 5. Transform each example using this index, either into a vector of ints or a dense float vector.
For more reference, see the documentation at the following link.
Code
text_vectorization = TextVectorization(
max_tokens=config.max_tokens,
standardize="lower_and_strip_punctuation",
split="whitespace",
output_mode="int",
output_sequence_length=config.output_sequence_length,
pad_to_max_tokens=True
)
# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)Code
text_vectorization = TextVectorization(
max_tokens=config.max_tokens,
standardize="lower_and_strip_punctuation",
split="whitespace",
output_mode="int",
output_sequence_length=config.output_sequence_length,
pad_to_max_tokens=True
)
# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)This layer is set to: - max_tokens: 20000. It is common for text classification. It is the maximum size of the vocabulary for this layer. - output_sequence_length: 911. See Figure 3 for the reason why. Only valid in "int" mode. - output_mode: outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1. - standardize: "lower_and_strip_punctuation". - split: on whitespace.
To preserve the original comments as text and also have a tf.data.Dataset in which the text is preprocessed by the TextVectorization function, it is possible to map it to the features of each dataset.
Code
processed_train_ds = train_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)Code
processed_train_ds = train_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=tf.data.experimental.AUTOTUNE
)5 Model
5.1 Definition
Define the model using the Functional API.
Code
def get_deeper_lstm_model():
clear_session()
inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
embedding = Embedding(
input_dim=config.max_tokens,
output_dim=config.embedding_dim,
mask_zero=True,
name="embedding"
)(inputs)
x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
# Global average pooling
x = GlobalAveragePooling1D()(x)
# Add regularization
x = Dropout(0.3)(x)
x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
x = LayerNormalization()(x)
outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
model = Model(inputs, outputs)
model.compile(optimizer='adam', loss="binary_crossentropy", metrics=config.metrics, steps_per_execution=32)
return model
lstm_model = get_deeper_lstm_model()
lstm_model.summary()Code
def get_deeper_lstm_model():
clear_session()
inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
embedding = Embedding(
input_dim=config.max_tokens,
output_dim=config.embedding_dim,
mask_zero=True,
name="embedding"
)(inputs)
x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
# Global average pooling
x = GlobalAveragePooling1D()(x)
# Add regularization
x = Dropout(0.3)(x)
x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
x = LayerNormalization()(x)
outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
model = Model(inputs, outputs)
model.compile(optimizer='adam', loss="binary_crossentropy", metrics=config.metrics, steps_per_execution=32)
return model
lstm_model = get_deeper_lstm_model()
lstm_model.summary()5.2 Callbacks
Finally, the model has been trained using 2 callbacks: - Early Stopping, to avoid to consume the kaggle GPU time. - Model Checkpoint, to retrieve the best model training information.
5.3 Final preparation before fit
Considering the dataset is imbalanced, to increase the performance we need to calculate the class weight. This will be passed during the training of the model.
| class_weight |
|---|
| 9.59% |
| 0.99% |
| 5.28% |
| 0.31% |
| 4.91% |
| 0.87% |
It is also useful to define the steps per epoch for train and validation dataset. This step is required to avoid to not consume entirely the dataset during the fit, which happened to me.
5.4 Fit
The fit has been done on Kaggle to levarage the GPU. Some considerations about the model:
-
.repeat()ensure the model sees all the dataset. -
epocsis set to 100. -
validation_datahas the same repeat. -
callbacksare the one defined before. -
class_weightensure the model is trained using the frequency of each class, because our dataset is imbalanced. -
steps_per_epochandvalidation_stepsdepend on the use ofrepeat.
Now we can import the model and the history trained on Kaggle.
5.5 Evaluate
Code
val_metrics <- tibble(
metric = c("loss", "precision", "recall", "auc", "f1_score"),
value = py$validation
)
val_metrics %>%
gt() %>%
fmt_number(
columns = c("value"),
decimals = 4,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = TRUE
) %>%
cols_align(
align = "left",
columns = metric
) %>%
cols_align(
align = "center",
columns = value
) %>%
cols_label(
metric = "Metric",
value = "Value"
)Code
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aad| Metric | Value |
|---|---|
| loss | 0.0542 |
| precision | 0.7888 |
| recall | 0.671 |
| auc | 0.9572 |
| f1_score | 0.0293 |
5.6 Predict
For the prediction, the model does not need to repeat the dataset, because it has already been trained on all of the train data. Now it has just to consume the new data to make the prediction.
5.7 Confusion Matrix
The best way to assess the performance of a multi label classification is using a confusion matrix. Sklearn has a specific function to create a multi label classification matrix to handle the fact that there could be multiple labels for one prediction.
5.7.1 Grid Search Cross Validation for best threshold
Grid Search CV is a technique for fine-tuning hyperparameter of a ML model. It systematically search through a set of hyperparamenter values to find the combination which led to the best model performance. In this case, I am using a KFold Cross Validation is a resempling technique to split the data into k consecutive folds. Each fold is used once as a validation while the k - 1 remaining folds are the training set. See the documentation for more information.
The model is trained to optimize the recall. The decision was made because the cost of missing a True Positive is greater than a False Positive. In this case, missing a injurious observation is worst than classifying a clean one as bad.
5.7.2 Confidence threshold and Precision-Recall trade off
Whilst the KFold GDCV technique is usefull to test multiple hyperparameter, it is important to understand the problem we are facing. A multi label deep learning classifier outputs a vector of per-class probabilities. These need to be converted to a binary vector using a confidence threshold.
- The higher the threshold, the less classes the model predicts, increasing model confidence [higher Precision] and increasing missed classes [lower Recall].
- The lower the threshold, the more classes the model predicts, decreasing model confidence [lower Precision] and decreasing missed classes [higher Recall].
Threshold selection mean we have to decide which metric to prioritize, based on the problem we are facing and the relative cost of misduging. We can consider the toxic comment filtering a problem similiar to cancer diagnostic. It is better to predict cancer in people who do not have it [False Positive] and perform further analysis than do not predict cancer when the patient has the disease [False Negative].
I decide to train the model on the F1 score to have a balanced model in both precision and recall and leave to the threshold selection to increase the recall performance.
Moreover, the model has been trained on the macro avarage F1 score, which is a single performance indicator obtained by the mean of the Precision and Recall scores of individual classses.
\[ F1\ macro\ avg = \frac{\sum_{i=1}^{n} F1_i}{n} \]
It is useful with imbalanced classes, because it weights each classes equally. It is not influenced by the number of samples of each classes. This is sette both in the config.metrics and find_optimal_threshold_cv.
f1_score
Code
Optimal threshold: 0.15000000000000002
Best score: 0.4788653077945807
Optimal threshold f1 score: 0.15. Best score: 0.4788653.
recall_score
Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_recall, best_score_recall = config.find_optimal_threshold_cv(ytrue, y_pred_proba, recall_score)
# Use the optimal threshold to make predictions
final_predictions_recall = (y_pred_proba >= optimal_threshold_recall).astype(int)Code
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aadOptimal threshold recall: 0.05. Best score: 0.8095814.
roc_auc_score
Code
Optimal threshold: 0.05
Best score: 0.8809499649742268
Optimal threshold roc: 0.05. Best score: 0.88095.
5.7.3 Confusion Matrix Plot
Code
# convert probability predictions to predictions
ypred = predictions >= optimal_threshold_recall # .05
ypred = ypred.astype(int)
# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(ax=axes[i], colorbar=False)
axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()Code
# convert probability predictions to predictions
ypred = predictions >= optimal_threshold_recall # .05
ypred = ypred.astype(int)
# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(ax=axes[i], colorbar=False)
axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()5.8 Classification Report
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>%
pivot_longer(
cols = -names,
names_to = "metrics",
values_to = "values"
) %>%
pivot_wider(
names_from = names,
values_from = values
) %>%
gt() %>%
tab_header(
title = "Confusion Matrix",
subtitle = "Threshold optimization favoring recall"
) %>%
fmt_number(
columns = c("precision", "recall", "f1-score", "support"),
decimals = 2,
drop_trailing_zeros = TRUE,
drop_trailing_dec_mark = FALSE
) %>%
cols_align(
align = "center",
columns = c("precision", "recall", "f1-score", "support")
) %>%
cols_align(
align = "left",
columns = metrics
) %>%
cols_label(
metrics = "Metrics",
precision = "Precision",
recall = "Recall",
`f1-score` = "F1-Score",
support = "Support"
)Code
>>>>>>> 3b10bbc8151b5cecde674b7ac0f95b00df939aad| Confusion Matrix | ||||
|---|---|---|---|---|
| Threshold optimization favoring recall | ||||
| Metrics | Precision | Recall | F1-Score | Support |
| toxic | 0.55 | 0.89 | 0.68 | 2,262. |
| severe_toxic | 0.24 | 0.92 | 0.37 | 240. |
| obscene | 0.55 | 0.94 | 0.69 | 1,263. |
| threat | 0.04 | 0.49 | 0.07 | 69. |
| insult | 0.47 | 0.91 | 0.62 | 1,170. |
| identity_hate | 0.12 | 0.72 | 0.2 | 207. |
| micro avg | 0.42 | 0.9 | 0.57 | 5,211. |
| macro avg | 0.33 | 0.81 | 0.44 | 5,211. |
| weighted avg | 0.49 | 0.9 | 0.63 | 5,211. |
| samples avg | 0.05 | 0.08 | 0.06 | 5,211. |
6 Conclusions
The BiLSTM model is optimized to have an high recall is performing good enough to make predictions for each label. Considering the low support for the threat label, the performance is not bad. See Table 2 and Figure 1: the threat label is only 0.27 % of the observations. The model has been optimized for recall because the cost of not identifying a injurious comment as such is higher than the cost of considering a clean comment as injurious.
Possibile improvements could be to increase the number of observations, expecially for the threat one. In general there are too many clean comments. This could be avoided doing an undersampling of the clean comment, which I explicitly avoided to check the performance on the BiLSTM with an imbalanced dataset, leveraging the class weight method.